None Copy_of_YouTube_proj

YouTube Project: Analyzing freeCodeCamp Comments

Overview

This project focuses on analyzing comments from freeCodeCamp's YouTube channel to uncover insights about viewer sentiment and key topics of discussion. By leveraging the YouTube Data API, I gathered video and comment data, which was then preprocessed to remove noise such as punctuation, stopwords, and emojis. I used sentiment analysis with a BERTweet model to classify the sentiment of each comment, and applied BERTopic for topic modeling to identify recurring themes across the comments.

The project aims to provide a deeper understanding of how viewers interact with freeCodeCamp content, what topics are of interest to them, and how they feel about the videos. This aligns with the concept of social listening, which involves monitoring and analyzing online conversations to gain insights into public opinion, identify trends, and understand the emotional tone surrounding specific topics or brands.

What is Social Listening?

Social listening refers to the process of monitoring digital conversations to understand what is being said about a brand, product, or topic on social media platforms and other online spaces. It involves gathering data from sources like social media comments, blog posts, and forums, and analyzing it to gain actionable insights. Social listening can help identify consumer sentiments, emerging trends, and feedback that can inform marketing strategies, product development, and customer engagement efforts.

In the context of this project, social listening is applied by analyzing YouTube comments on freeCodeCamp’s videos. The sentiment and topic modeling results provide valuable feedback on how the audience is responding to the content. By understanding these dynamics, I can assess the effectiveness of freeCodeCamp's educational videos and recognize areas for improvement or growth.

It is important to note that this analysis is based on a small sample of data due to API limits. The data collected represents only a subset of the videos and comments from freeCodeCamp’s YouTube channel. This sample was extracted to demonstrate the process, but a more comprehensive analysis could be performed with broader data access.

In [ ]:
#Import modules
from googleapiclient.errors import HttpError
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import emoji
from transformers import pipeline
from bertopic import BERTopic
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud
import string
from PIL import Image
from googleapiclient.discovery import build
import time
import random
import plotly.express as px
from collections import Counter
from IPython.display import display, Markdown, HTML
import plotly.io as pio

pio.renderers.default = "notebook"

# Download NLTK resources (run only once)
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

Fetch data from API

In [ ]:
# Set up YouTube API credentials
api_key = ""
youtube = build("youtube", "v3", developerKey=api_key)

# Define the channel ID for freeCodeCamp
channel_id = "UC8butISFwT-Wl7EV0hUK0BQ"  # freeCodeCamp.org YouTube Channel ID

# Function to fetch videos from the channel
def fetch_comments(video_id, max_comments=500):
    comments = []
    next_page_token = None
    comment_count = 0

    while comment_count < max_comments:
        try:
            request = youtube.commentThreads().list(
                part="snippet", videoId=video_id, textFormat="plainText", maxResults=100, pageToken=next_page_token
            )
            response = request.execute()

            for item in response['items']:
                comment_data = {
                    'Comment': item['snippet']['topLevelComment']['snippet']['textDisplay'],
                    'Author': item['snippet']['topLevelComment']['snippet']['authorDisplayName'],
                    'Comment ID': item['snippet']['topLevelComment']['id'],
                    'Published At': item['snippet']['topLevelComment']['snippet']['publishedAt']
                }
                comments.append(comment_data)
                comment_count += 1
                if comment_count >= max_comments:
                    break

            next_page_token = response.get('nextPageToken')
            if not next_page_token:
                break

            time.sleep(random.uniform(1, 3))  # Add random delay to reduce API load

        except HttpError as e:
            if e.resp.status == 403:  # Quota exceeded error
                print("Quota exceeded while fetching comments. Stopping the data collection.")
                return pd.DataFrame(comments)  # Return whatever has been collected so far
            else:
                print(f"Error fetching comments for video {video_id}: {e}")
                time.sleep(10)  # If an error occurs, wait before retrying

    return pd.DataFrame(comments)

def fetch_videos(channel_id, max_videos=500):
    videos = []
    request = youtube.channels().list(part="contentDetails", id=channel_id)
    response = request.execute()

    playlist_id = response['items'][0]['contentDetails']['relatedPlaylists']['uploads']
    next_page_token = None
    video_count = 0

    while video_count < max_videos:
        try:
            request = youtube.playlistItems().list(
                part="snippet", playlistId=playlist_id, maxResults=50, pageToken=next_page_token
            )
            response = request.execute()

            for item in response['items']:
                video_data = {
                    'Video Title': item['snippet']['title'],
                    'Video ID': item['snippet']['resourceId']['videoId'],
                    'Published At': item['snippet']['publishedAt'],
                    'Description': item['snippet']['description'],
                    'Video URL': f"https://www.youtube.com/watch?v={item['snippet']['resourceId']['videoId']}"
                }
                videos.append(video_data)
                video_count += 1
                if video_count >= max_videos:
                    break

            next_page_token = response.get('nextPageToken')
            if not next_page_token:
                break

            time.sleep(random.uniform(1, 3))  # Avoid exceeding rate limits

        except HttpError as e:
            if e.resp.status == 403:  # Quota exceeded error
                print("Quota exceeded while fetching videos. Stopping the data collection.")
                return pd.DataFrame(videos)  # Return whatever has been collected so far
            else:
                print(f"Error fetching videos: {e}")
                time.sleep(10)  # If an error occurs, wait before retrying

    return pd.DataFrame(videos)

# Collect video data (limit to 500 videos for quota safety)
videos_df = fetch_videos(channel_id, max_videos=500)
print(f"Collected {len(videos_df)} videos.")

# Collect comments for each video (limit to 500 comments per video)
comments_df = pd.DataFrame(columns=['Comment', 'Author', 'Comment ID', 'Published At', 'Video ID'])
for video_id in videos_df['Video ID']:
    print(f"Fetching comments for video {video_id}")
    video_comments_df = fetch_comments(video_id, max_comments=500)

    if not video_comments_df.empty:
        video_comments_df['Video ID'] = video_id
        comments_df = pd.concat([comments_df, video_comments_df], ignore_index=True)

    # Save collected data at each step to avoid losing progress
    videos_df.to_csv("freecodecamp_videos.csv", index=False)
    comments_df.to_csv("freecodecamp_comments.csv", index=False)

    # If quota is exceeded, break out of the loop
    if video_comments_df.empty:
        break

print(f"Collected {len(comments_df)} comments.")

Data Cleaning

In [ ]:
comments_df = pd.read_csv('freecodecamp_comments.csv')
In [ ]:
videos_df = pd.read_csv('freecodecamp_videos.csv')
In [ ]:
comments_df.head()
Out[ ]:
Comment Author Comment ID Published At Video ID
0 0:00 Just watching the video to learn english:... @ottomuller6222 UgxISJ37o8HPk4i1OXZ4AaABAg 2025-03-20T15:08:01Z wuVVclLcjuA
1 COBOL @justwanderin847 Ugwpn4yQ0SAXS6ntXDt4AaABAg 2025-03-20T14:28:05Z wuVVclLcjuA
2 you know what’s wild? that somebody has to exp... @thghtfl UgzAfOj88p7zjs4XNoV4AaABAg 2025-03-20T13:17:16Z wuVVclLcjuA
3 I really like the advanced voice mode of the C... @Ph34rNoB33r UgxrQ_ykBj_mr36ED8V4AaABAg 2025-03-20T12:34:33Z wuVVclLcjuA
4 Well done Boris, great accent for a teacher. @evgenymagidson4434 UgweXFBBkw4B6Aru9Ax4AaABAg 2025-03-20T11:54:53Z wuVVclLcjuA
In [ ]:
videos_df.head()
Out[ ]:
Video Title Video ID Published At Description Video URL
0 Learn ANY Language with AI (Learn English, Lea... wuVVclLcjuA 2025-03-19T15:53:38Z Discover how to master any language, including... https://www.youtube.com/watch?v=wuVVclLcjuA
1 Build a Full Stack AI Note Taking App with Nex... 6ChzCaljcaI 2025-03-18T15:12:29Z Build a full-stack note-taking app with the Ne... https://www.youtube.com/watch?v=6ChzCaljcaI
2 How to become a self-taught developer while su... 28c0QMQZ5yA 2025-03-14T22:04:40Z On this week's episode of the podcast, freeCod... https://www.youtube.com/watch?v=28c0QMQZ5yA
3 AWS Cognito Course – Authentication and Author... ajExOgOCJXY 2025-03-13T14:37:57Z This comprehensive AWS Cognito course covering... https://www.youtube.com/watch?v=ajExOgOCJXY
4 JavaScript Essentials Course 876aSEUA_8c 2025-03-12T18:06:03Z Learn JavaScript essentials with this course t... https://www.youtube.com/watch?v=876aSEUA_8c
In [ ]:
stop_words = set(stopwords.words('english'))
In [ ]:
def preprocess_text(text):
  """
    Preprocesses text by removing punctuation, converting to lowercase, and removing stopwords.

    Args:
        text (str): The text to preprocess.

    Returns:
        str: The preprocessed text.
    """
  if pd.isna(text):
    return None
  else:
    # Remove punctuation and convert to lower case
    text = text.translate(str.maketrans("", "", string.punctuation)).lower()
    # Remove stopwords
    text = " ".join([word for word in text.split() if word not in stop_words])
    return text


# Apply preprocessing to the 'Comment' column
comments_df['Processed_Text'] = comments_df['Comment'].apply(preprocess_text)
In [ ]:
def translate_emojis(text):
  """
  Translates emojis in text to their textual descriptions.

  Args:
      text (str): The text containing emojis.

  Returns:
      str: The text with emojis translated.
  """
  if pd.isna(text):
    return None
  else:
    return emoji.demojize(text)#.replace(":", " ")
In [ ]:
# Apply demojize to the entire 'Comment' column
comments_df['Comment_no_emojis'] = comments_df['Processed_Text'].apply(lambda text: translate_emojis(text))

Missing values

In [ ]:
videos_df.isnull().sum()
Out[ ]:
0
Video Title 0
Video ID 0
Published At 0
Description 12
Video URL 0

In [ ]:
comments_df.isnull().sum()
Out[ ]:
0
Comment 35
Author 45
Comment ID 0
Published At 0
Video ID 0
Processed_Text 35
Comment_no_emojis 35

In [ ]:
#Remove the rows with missing comments
comments_df = comments_df.dropna()
In [ ]:
#Extract date
comments_df['Published_date'] = pd.to_datetime(comments_df['Published At']).dt.date
videos_df['Published_date'] = pd.to_datetime(videos_df['Published At']).dt.date

Descriptive Statistics

In [ ]:
plt.figure(figsize=(12, 6))
sns.histplot(comments_df['Published_date'], bins=50, kde=True)
plt.xlabel("Published Date")
plt.ylabel("Number of Comments")
plt.title("Distribution of YouTube Comments Over Time")
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image
In [ ]:
plt.figure(figsize=(12, 6))
sns.histplot(comments_df.Comment.apply(len), bins=200, kde=True)
plt.xlabel("Character Length")
plt.ylabel("Number of Comments")
plt.title("Distribution of Comment Character Length")
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image
In [ ]:
comments_df.Comment.apply(len).describe()#Comment length
Out[ ]:
Comment
count 59436.000000
mean 118.133589
std 344.995777
min 1.000000
25% 25.000000
50% 56.000000
75% 117.000000
max 9993.000000

In [ ]:
def wc_preprocess_text(text):
  # Remove punctuation and convert to lower case
  text = text.translate(str.maketrans("", "", string.punctuation)).lower()
  # Initialize lemmatizer
  lemmatizer = WordNetLemmatizer()

  # Tokenize the text
  tokens = word_tokenize(text.lower())  # Tokenize and convert to lowercase

  # Lemmatize the tokens
  tokens = [lemmatizer.lemmatize(word) for word in tokens if not word.isnumeric()]
  return tokens
In [ ]:
comments_df['processed_text_tokens'] = comments_df['Processed_Text'].apply(wc_preprocess_text)
In [ ]:
# Flatten the list of tokens
all_words = [word for tokens in comments_df['processed_text_tokens'] for word in tokens]

# Count word occurrences
word_counts = Counter(all_words)

# Get the top 10 most common words
top_10_words = word_counts.most_common(10)
In [ ]:
display(Markdown(f"##### Key Metrics"))

display(Markdown(f"* Total Comments Analyzed: {len(comments_df)} for {len(videos_df)} videos, which is equal to around {round(len(comments_df)/len(videos_df),2)} comments per video"))
display(Markdown(f"*  Date Range of Comments: {comments_df['Published_date'].min()} to {comments_df['Published_date'].max()}"))
display(Markdown(f"* Unique Users: {len(comments_df['Author'].unique())}"))
display(Markdown(f"* Average Comment Length: {comments_df.Comment.apply(len).mean().astype(int)} characters"))
display(Markdown(f"* Most Frequent Words:"))
display(pd.DataFrame(top_10_words,columns=['Word','Count']))
Key Metrics
  • Total Comments Analyzed: 59436 for 500 videos, which is equal to around 118.87 comments per video
  • Date Range of Comments: 2022-07-21 to 2025-03-21
  • Unique Users: 47864
  • Average Comment Length: 118 characters
  • Most Frequent Words:
Word Count
0 video 7408
1 course 7203
2 thank 5736
3 thanks 4776
4 like 4491
5 code 3814
6 tutorial 3726
7 great 3374
8 please 3250
9 much 3194
Conclusion

The descriptive statistics provide a quantitative overview of the dataset, highlighting engagement levels and common patterns in comment length and vocabulary usage.

Word Cloud

In [ ]:
def bag_of_words_tokens_to_string(tokens):
  # Flatten the list and join items with a space
  flattened_string = ' '.join([item for sublist in tokens for item in sublist])
  return flattened_string
In [ ]:
comment_mask = np.array(Image.open('./comment_mask.png'))
In [ ]:
def plot_bag_of_words(data=comments_df,sentiment='ALL',colormap = 'BuPu_r'):
  # Generate the word cloud
  if sentiment == 'ALL':
    bag_of_wrods_string = bag_of_words_tokens_to_string(list(data['processed_text_tokens']))
  else:
    sent_tokens_dict = {'POS':list(data[data['Sentiment_label']=='POS']['processed_text_tokens']),
                    'NEG':list(data[data['Sentiment_label']=='NEG']['processed_text_tokens']),
                    'NEU':list(data[data['Sentiment_label']=='NEU']['processed_text_tokens'])}
    bag_of_wrods_string = bag_of_words_tokens_to_string(sent_tokens_dict[sentiment])
  wordcloud = WordCloud(background_color = 'white', mask = comment_mask, contour_width = 2,
      contour_color = 'black', colormap = colormap, width = 800, height = 500).generate(bag_of_wrods_string)
  plt.imshow(wordcloud)
  plt.axis("off")
  return wordcloud
In [ ]:
all_cloud = plot_bag_of_words()
No description has been provided for this image
In [ ]:
all_cloud.to_file("wordcloud_all.png")
Out[ ]:
<wordcloud.wordcloud.WordCloud at 0x7c7c67335490>

Sentiment analysis

Using bertweet-base-sentiment-analysis helps classify comments as positive (POS), negative (NEG), or neutral (NEU) Pérez et al., 2021.

Why Use This Model?

  • Trained on social media - Handles slang, emojis, and informal text.
  • Easy to use with Hugging Face Pipelines.
In [ ]:
sentiment_pipeline = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis")
Device set to use cpu

Handling Emojis in Sentiment Analysis

Emojis can impact sentiment classification. Let's compare how the model classifies text with emojis vs. after emoji translation.

In [ ]:
def detect_emojis(text):
  emoji_list = [char for char in text if emoji.is_emoji(char)]
  return len(emoji_list)>0
In [ ]:
text_with_emoji_list = [text for text in comments_df['Processed_Text'].astype(str) if detect_emojis(text)]
print(f'{len(text_with_emoji_list)}({round(len(text_with_emoji_list)/len(comments_df),2)*100}%) comments have at least one emojis')
9328(16.0%) comments have at least one emojis
In [ ]:
sentiment_without_emoji = []
sentiment_with_emoji = []
comment = []
for idx,row in comments_df.iterrows():
  text_with_emojis = row['Processed_Text']
  text_without_emojis = row['Comment_no_emojis']
  if detect_emojis(text_with_emojis):
    sent_without = sentiment_pipeline([text_without_emojis],truncation=True)
    sent_with = sentiment_pipeline([text_with_emojis],truncation=True)
    sentiment_without_emoji.append(sent_without[0]['label'])
    sentiment_with_emoji.append(sent_with[0]['label'])
    comment.append(row['Comment'])
In [ ]:
emoji_sent_comapre_df = pd.DataFrame({'Comment':comment,'Sentiment_without_emoji':sentiment_without_emoji,'Sentiment_with_emoji':sentiment_with_emoji})
In [ ]:
#Plot the sentiment distributions for texts with emojis versus those without emojis.
plt.figure(figsize=(10,5))
sns.histplot(emoji_sent_comapre_df["Sentiment_with_emoji"], label="With Emoji", color="blue", alpha=0.6)
sns.histplot(emoji_sent_comapre_df["Sentiment_without_emoji"], label="Translated", color="red", alpha=0.6)
plt.xlabel("Sentiment")
plt.ylabel("Count")
plt.title("Sentiment Distribution: Emoji vs. Translated")
plt.legend()
plt.show()
No description has been provided for this image
In [ ]:
# Get a clear view of the sentiment transitions between with and without emojis
tras_value = {
    'NEU':{'POS':0,'NEG':0},
    'POS':{'NEG':0,'NEU':0},
    'NEG':{'POS':0,'NEU':0}
}

for idx,row in emoji_sent_comapre_df.iterrows():
  if row['Sentiment_with_emoji']!=row['Sentiment_without_emoji']:
    tras_value[row['Sentiment_with_emoji']][row['Sentiment_without_emoji']] += 1

# 'tras_value' is a dictionary that keeps track of the transitions in sentiment labels between two columns:
# 'Sentiment_with_emoji' and 'Sentiment_without_emoji'. The dictionary stores the count of how often
# a specific sentiment changes from one category to another (from 'with emoji' to 'without emoji').
# The structure is as follows:
#   - 'NEU' -> Neutral sentiment
#   - 'POS' -> Positive sentiment
#   - 'NEG' -> Negative sentiment
#
# For example, if a sentiment with an emoji is 'POS' and the sentiment without an emoji is 'NEG',
# 'tras_value['POS']['NEG']' will increment by 1, indicating that a positive sentiment with an emoji
# transitions to a negative sentiment without the emoji.

tras_value
Out[ ]:
{'NEU': {'POS': 1351, 'NEG': 201},
 'POS': {'NEG': 6, 'NEU': 59},
 'NEG': {'POS': 26, 'NEU': 157}}
In [ ]:
# Let's look at some examples of sentiment transitions
emoji_non_emoji_sentiment_change = emoji_sent_comapre_df[emoji_sent_comapre_df['Sentiment_without_emoji']!=emoji_sent_comapre_df['Sentiment_with_emoji']].sample(n=20)
for idx,row in emoji_non_emoji_sentiment_change.iterrows():
  print(f"Comment {idx}: {row['Comment']}")
  print(f'Sentiment without emoji: {row["Sentiment_without_emoji"]}')
  print(f'Sentiment with emoji: {row["Sentiment_with_emoji"]}')
Comment 1456: ❤
Sentiment without emoji: POS
Sentiment with emoji: NEU
Comment 3511: I woke up 😊
Sentiment without emoji: POS
Sentiment with emoji: NEU
Comment 6159: We need Scapy Course Aloso😢
Sentiment without emoji: NEG
Sentiment with emoji: NEU
Comment 5873: I tried for as long as I could, but whoever wrote the script that he’s reading, come on, please. That was torturous. Drinking game: Take a shot every time he says "database management system." Don't...you will die.

"UX [pause] user experience" 😄
Sentiment without emoji: NEU
Sentiment with emoji: NEG
Comment 1399: Thank you 🙏
Sentiment without emoji: POS
Sentiment with emoji: NEU
Comment 9130: I appreciate the efforts put in this, but it's not a course. It's an audio book 😥
Sentiment without emoji: NEU
Sentiment with emoji: POS
Comment 4586: We need backend roadmap .Who agree with me ❤?
Sentiment without emoji: POS
Sentiment with emoji: NEU
Comment 1629: The joy of being the first to comment 😂❤
Sentiment without emoji: POS
Sentiment with emoji: NEU
Comment 2510: Support unlimited ❤
Sentiment without emoji: POS
Sentiment with emoji: NEU
Comment 4164: You have won me at your indian accent ❤
Sentiment without emoji: POS
Sentiment with emoji: NEU
Comment 1987: 16:42 It is 56 😊
Sentiment without emoji: POS
Sentiment with emoji: NEU
Comment 4762: Most revolutionary framework ever❤
Sentiment without emoji: POS
Sentiment with emoji: NEU
Comment 8473: .........wow 😊.....thanks✋🙏🙌👏
Sentiment without emoji: NEU
Sentiment with emoji: NEG
Comment 4732: 😍
Sentiment without emoji: POS
Sentiment with emoji: NEU
Comment 173: Memories 😢
Sentiment without emoji: NEG
Sentiment with emoji: NEU
Comment 482: 👍
Sentiment without emoji: POS
Sentiment with emoji: NEU
Comment 4897: ❤
Sentiment without emoji: POS
Sentiment with emoji: NEU
Comment 7502: Thank you bro ♥
Sentiment without emoji: POS
Sentiment with emoji: NEU
Comment 3555: 🔥😍 thank you all
Sentiment without emoji: POS
Sentiment with emoji: NEU
Comment 459: Thanks for this ☺️
Sentiment without emoji: POS
Sentiment with emoji: NEU

A great example of why translating emojis instead of leaving them as-is (or removing them) is beneficial can be seen in the following cases:

  • The comment "Deepseek gonna takeover everything soon 😢" transitioned from Neutral to Negative after the emoji was translated.

  • Similarly, "Hi, I started this course about a week ago and I am having a bit of trouble with the SQL intermediate phase and would really appreciate some assistance 😢" shifted from Neutral to Negative once the emoji was translated.

  • Some comments containing only an emoji, like "❤," seemed to transition from Neutral to Positive/Negative after the emoji was translated.

It appears that translating the emojis improved performance, with most changes occurring from Neutral to either Positive or Negative. This suggests that translating the emoji helps the model correctly understand the context of comments with emojis.

In [ ]:
non_emoji_docs = list(comments_df.Comment_no_emojis)
In [ ]:
#Long texts will be be truncuated
comments_df['Sentiment'] = sentiment_pipeline(non_emoji_docs,truncation=True)
In [ ]:
comments_df['Sentiment_label'] = comments_df['Sentiment'].apply(lambda x: x['label'])
comments_df['Sentiment_score'] = comments_df['Sentiment'].apply(lambda x: x['score'])
In [ ]:
pio.renderers.default = "notebook"

sent_freq = comments_df['Sentiment_label'].value_counts().reset_index()
sent_freq['p'] = sent_freq['count']/sent_freq['count'].sum()

fig = px.pie(sent_freq, values='p', names='Sentiment_label', title='Percentage of comments by sentiment category')
fig.show()

Let's look at some examples of each sentiment

In [ ]:
# Define sentiments
sentiments = {'POS':'Greens', 'NEG':'Reds', 'NEU':'spring'}

for sentiment,colormap in sentiments.items():
  display(Markdown(f"## {sentiment}: Example documents"))
  random_examples = comments_df[comments_df['Sentiment_label']==sentiment].sample(n=10)['Comment'].values
  for idx,comment in enumerate(random_examples):
    print(f'Example {idx}: {comment}')
  display(Markdown(f"## {sentiment}: Word cloud"))
  cloud = plot_bag_of_words(sentiment=sentiment,colormap=colormap)
  cloud.to_file(f"wordcloud_{sentiment}.png")

POS: Example documents

Example 0: We appreciate your efforts FCC🎉.
Example 1: bless dis boy
Example 2: I liked the proc debugging 3:53:00
Example 3: Simply Awesome!
Example 4: This channel is the best channel for programmers on YouTube, Thank you for making this Channel you guys are legends ❤
Example 5: Thx 🙏🏾 ☺️
Example 6: Thank you ❤
Example 7: I graduated may 2023 with a masters in CS and im still applying to jobs. I luckily have been able to keep my student position which pays the bills, but i really just want to get out and develop. Thanks for this course, looking forward to getting through it.
Example 8: Please make same video for ai dev road map 2024 it will be great
Example 9: Playback speed 0.5 absolutely necessary for this video

POS: Word cloud

NEG: Example documents

Example 0: im so frustrated, cause the part where he scrapes the data into the SQL database isn't explained properly enough and that is generally my issue with trying to learn from @freeCodeCamp he made it seem like its two simple steps now ive been on the problem for the last one hour smh
Example 1: A readonly array doesn't allow you to push. 1:36:40
Example 2: Don't watch it just for power bi like me.....its worst
Example 3: Why are coders so socially awkward. The guy is a fucking robot. I would stop myself with getting that job and go next by seeing his robot expressions. Is that someone you want to have at your work? You're spending 40 hours a week at work, better have collegues who have at least some facial expression, emotion and a feeling of welcome.
Example 4: Ok stop spying on me FCC ! how do you know exactly what i've been looking up ?
Example 5: Great tutorial but the Spotify signup part is unbearable.
Example 6: It will take 1 Yr to go through this course and by then it might become obsolete or change substantially..
Example 7: There are flaws with some of those solutions. Like flagging an account as spam if it is follower by spam accounts. That could lead to valid accounts being flagged as spam or even attacked by bots intentionally adding users to flag them as spam. The thing about using email from random domains is also problematic in many ways, and also using emails that use random characters (some of us use random emails for different accounts precisely to keep spam away and improve privacy). You could also not even guess bot characteristics and feed data for models to try to find common characteristics and trends.
Example 8: Exposing S3 env variables to the client is bad! Please use an API instead or presigned URLs!
Example 9: 26:51 model: gpt-4 doesn't work I got 404 req. Solution ---> gpt-3.5-turbo

NEG: Word cloud

NEU: Example documents

Example 0: Is it for the absolute beginner?
Example 1: I follow you with great interest! but I want to ask you if there are big differences between the operating systems on which the game runs! for example, following a tutorial of a PC game, could I release it as an Android game?
Example 2: Shout out to Tim! The Eddie Murphy of music production 👏👏👏
Example 3: <3 the Solid snake
Example 4: One of the first things I learnt at school to do good exposé is to write as little text as possible, especially not the text you're already saying out loud, and this video is full of it :(  
Makes it harder to remember things
Example 5: Is C++ used?
Example 6: can you please answer me? do I have to mathematics to be able to work in AI ?
if yes what branches of Mathematics do I have to know?
like Linear Algebra, Probability?
Example 7: Has this been released? I still see version 18.x on React official site.
Example 8: Bro taught us entire 12th grade math in one video 💀
Example 9: Thanks!

NEU: Word cloud

No description has been provided for this image

Distribution of YouTube Comments Over Time Grouped by Sentiment

In [ ]:
plt.figure(figsize=(12, 6))
sns.histplot(data = comments_df,x='Published_date',hue='Sentiment_label', bins=50, kde=True)
plt.xlabel("Published Date")
plt.ylabel("Number of Comments")
plt.title("Distribution of YouTube Comments Over Time grouped by sentiment")
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image
In [ ]:
display(Markdown(f"##### Key Metrics"))
for sentiment in sentiments.keys():
  sentiment_len = len(comments_df[comments_df['Sentiment_label']==sentiment])
  total_len = len(comments_df)
  display(Markdown(f"* {sentiment} Sentiments: {sentiment_len}({round(sentiment_len/total_len*100,2)} %) comments"))
Key Metrics
  • POS Sentiments: 20601(34.66 %) comments
  • NEG Sentiments: 7136(12.01 %) comments
  • NEU Sentiments: 31699(53.33 %) comments

Sentiment analysis Conclusion

The sentiment analysis shows that most comments are neutral, with positive sentiment comprising a substantial portion and negative sentiment highlighting areas for potential improvement.

Topic Modeling (BERTopic)

To extract meaningful themes, BERTopic was applied separately to each sentiment category:

  • Positive Comments: Identifying themes in praises, appreciation, and positive feedback.

  • Negative Comments: Highlighting criticisms, complaints, and areas for improvement.

  • Neutral Comments: Extracting general discussion topics without strong sentiment.

In [ ]:
sent_model_result = {'POS':{},
                     'NEG':{},
                     'NEU':{}}
for sentiment in ['POS','NEG','NEU']:
  print(f'Processing {sentiment} comments')
  print(f'Number of comments: {len(comments_df[comments_df.Sentiment_label==sentiment])}')
  non_emoji_docs_sent = list(comments_df.Comment_no_emojis[comments_df.Sentiment_label==sentiment].astype(str))
  topic_model = BERTopic(nr_topics=11)
  topics, probs = topic_model.fit_transform(non_emoji_docs_sent)
  sent_model_result[sentiment] = {'topics':topics,'probs':probs,'topic_model':topic_model}
Processing POS comments
Number of comments: 20601
Processing NEG comments
Number of comments: 7136
Processing NEU comments
Number of comments: 31699

For each sentiment:

  • After generating topics and their probabilities:

    • I can access the most frequent topics that were generated.

    • Investigate the relationships between topics.

Notes:

  • Topic -1 refers to all outliers and should typically be ignored.

  • I limited the number of topics to 10 to improve interpretability and ensure meaningful topic extraction using the BERTopic model.

In [ ]:
# Define sentiments
sentiments = ['POS', 'NEG', 'NEU']
topic_summary_df = pd.DataFrame()
for sentiment in sentiments:
    display(Markdown(f"## {sentiment}: Sentiment Analysis"))

    topic_model = sent_model_result[sentiment]['topic_model']

    # Get sorted topics
    get_topic_info_sorted = topic_model.get_topic_info().sort_values(by='Count', ascending=False)
    sorted_topics = list(get_topic_info_sorted['Topic'])

    # Remove outlier topic (-1)
    if -1 in sorted_topics:
        sorted_topics.remove(-1)
    display(get_topic_info_sorted)  # Display DataFrame in Jupyter

    # Retrieve the top 2 topics based on their frequency (count)
    top_2_topics = get_topic_info_sorted[get_topic_info_sorted['Topic'].isin(sorted_topics)].reset_index(drop=True).loc[:1]
    # Get the top keywords representing the selected topics
    Top_Keywords = top_2_topics['Representation']
    # Calculate the percentage of comments for the selected topics
    Percentage_of_Comments = round(top_2_topics['Count']/get_topic_info_sorted['Count'].sum()*100,2)
    # Convert Percentage_of_Comments to string and add '%' suffix
    Percentage_of_Comments = Percentage_of_Comments.astype(str) + '%'
    # Create a temporary DataFrame for the sentiment with top keywords and percentage of comments
    sent_topic_summary_df = pd.DataFrame({'Sentiment':sentiment,'Top Keywords':Top_Keywords,'Percentage of Comments':Percentage_of_Comments})
    # Concatenate the temporary DataFrame to the main summary DataFrame
    topic_summary_df = pd.concat([topic_summary_df,sent_topic_summary_df])

    # Generate topic bar chart
    display(Markdown(f"### Top Words per Topic"))
    fig = topic_model.visualize_barchart(topics=sorted_topics)
    fig.show()

    # Generate topic visualization
    display(Markdown(f"### Topic Clustering Visualization"))
    try:
        fig = topic_model.visualize_topics()
        fig.write_html(f"{sentiment.lower()}_2d.html")
        fig.show()
    except:
        print('Not enough topics for 2D visualization')

    # Generate heatmap
    display(Markdown(f"### Topic Similarity Heatmap"))
    try:
        fig = topic_model.visualize_heatmap()
        fig.show()
    except:
        print('Not enough topics for heatmap')

POS: Sentiment Analysis

Topic Count Name Representation Representative_Docs
1 0 7068 0_course_thank_video_great [course, thank, video, great, thanks, tutorial... [great video thanks lot, thank great course, g...
0 -1 6781 -1_video_course_thank_thanks [video, course, thank, thanks, great, tutorial... [thank course want need learn github also styl...
2 1 3204 1_redheart_thank_thumbsup_partypopper [redheart, thank, thumbsup, partypopper, thank... [thank much :smiling_face_with_smiling_eyes:, ...
3 2 2921 2_thank_much_thanks_awesome [thank, much, thanks, awesome, nice, great, da... [thank much, thank much, thank much]
4 3 324 3_india_proud_love_ali [india, proud, love, ali, gracias, tunisiatuni... [love india, love india, love india :red_heart:]
5 4 162 4_flutter_mongodb_error_tutorial [flutter, mongodb, error, tutorial, laravel, a... [hello rivaan fellow mates implemented project...
6 5 73 5_lets_go_gooooo_gooo [lets, go, gooooo, gooo, goooo, 2025, still, g... [lets go, lets go, lets go]
7 6 23 6_shes_cute_smart_beautiful [shes, cute, smart, beautiful, camillashes, bl... [shes confident, wow shes cute, shes awesome]
8 7 21 7_gpu_ryzen_cpu_pc [gpu, ryzen, cpu, pc, intel, elliots, gpustat,... [hi freecodecamp usual nice video quick tip th...
9 8 13 8_quincy_keep_leon_really [quincy, keep, leon, really, johnny, groves, r... [wow quincy works fast, really hyped quincy, g...
10 9 11 9_rollingonthefloorlaughing_rollingonthefloorl... [rollingonthefloorlaughing, rollingonthefloorl... [341555 fffffine gotcha andrew:rolling_on_the_...

Top Words per Topic

Topic Clustering Visualization

Topic Similarity Heatmap

NEG: Sentiment Analysis

Topic Count Name Representation Representative_Docs
0 -1 2938 -1_dont_code_like_im [dont, code, like, im, get, error, cant, video... [im stuck since one week qovery part im angry ...
1 0 1184 0_video_accent_english_understand [video, accent, english, understand, videos, l... [dont understand indian accent, good video awf...
2 1 1142 1_error_getting_help_get [error, getting, help, get, cant, code, work, ... [one help getting import error app2 already in...
3 2 963 2_ai_dont_tutorial_people [ai, dont, tutorial, people, like, learn, job,... [ven 25yearolds cant find jobs days despite kn...
4 3 416 3_hours_time_waiting_waste [hours, time, waiting, waste, understand, damn... [waiting since long time, waiting long time, 5...
5 4 201 4_react_javascript_typescript_js [react, javascript, typescript, js, nextjs, us... [angular matured professional framework simple...
6 5 113 5_css_tailwind_mode_light [css, tailwind, mode, light, dark, html, class... [tailwind css styles didnt apply, thanks tutor...
7 6 101 6_spam_google_crash_crypto [spam, google, crash, crypto, cryptocurrency, ... [wait think cryptocurrency crash dont think im...
8 7 40 7_python_formatter_formatting_please [python, formatter, formatting, please, mentio... [formatting working follow process please upda...
9 8 28 8_racist_art_ai_artist [racist, art, ai, artist, steal, humans, skin,... [isnt ai kinda racist, racist ai, racist ai]
10 9 10 9_validation_dataset_loss_data [validation, dataset, loss, data, train, wrong... [see spare data 02 dataset used validation don...

Top Words per Topic

Topic Clustering Visualization

Topic Similarity Heatmap

NEU: Sentiment Analysis

Topic Count Name Representation Representative_Docs
0 -1 9887 -1_video_like_course_code [video, like, course, code, please, use, using... [appreciate work put definetly know talking th...
1 0 8393 0_course_video_code_ai [course, video, code, ai, using, please, aws, ... [cloud concepts 2424 cloud computing 2757 comm...
2 1 6114 1_thanks_thank_first_comment [thanks, thank, first, comment, facewithtearso... [thanks, thanks, thanks]
3 2 3626 2_day_hours_2024_time [day, hours, 2024, time, feb, 10, let, timesta... [day 1 14500 day 2 30000 day 3 50000 day 4 700...
4 3 1173 3_react_typescript_angular_npm [react, typescript, angular, npm, javascript, ... [typescript popular programming language based...
5 4 975 4_css_image_tailwind_html [css, image, tailwind, html, slides, animation... [please course css, anyone tell tailwind css, ...
6 5 911 5_error_data_file_sql [error, data, file, sql, get, help, anyone, ap... [sorry id like somebody explain consider using...
7 6 372 6_de_que_no_en [de, que, no, en, vai, nada, fazer, el, espaol... [dj gustavo não vai fazer nada pra ele não vai...
8 7 114 7_cannafarm_ltd_xai910k_stocks [cannafarm, ltd, xai910k, stocks, finance, str... [really appreciate clear simple breakdown fina...
9 8 111 8_tunisiatunisiatunisiatunisiatunisia_tunisiat... [tunisiatunisiatunisiatunisiatunisia, tunisiat... [big ali bouali :Tunisia::Tunisia:, thank sir ...
10 9 23 9_god_allah_love_jesus [god, allah, love, jesus, shall, worship, save... [gospel moreover brethren declare unto gospel ...

Top Words per Topic

Topic Clustering Visualization

Topic Similarity Heatmap

Topic modeling overview

Key Metrics: Top Topic Extracted from Each Sentiment

  • Positive Sentiment (POS):

    • Top Keywords: "course", "thank", "video", "great", "thanks"

    • Percentage of Comments from positive sentiment comments: 34.31%

    • Dominated by expressions of thanks and appreciation, with some mentions of specific technical topics like "tutorial" and "react."

  • Negative Sentiment (NEG):

    • Top Keywords: "video", "accent", "english", "understand", "ads"

    • Percentage of Comments from negative sentiment comments: 16.59%

    • Negative sentiments reflect issues with understanding accents, video quality, and ads.

  • Neutral Sentiment (NEU):

    • Top Keywords: "course", "video", "code", "ai", "please"

    • Percentage of Comments from neutral sentiment comments: 26.48%

    • Neutral sentiments indicate general discussions of the course and coding, with users providing thanks and requests for clarification.

Conclusion

The topic modeling results provide valuable insights into audience sentiment and engagement with the content. Positive comments, are primarily driven by gratitude and appreciation for the course, with some focus on specific technical topics. Negative comments highlight areas for potential improvement, particularly regarding language barriers, video quality, and ad disruptions. Neutral comments reflect general course discussions, coding-related topics, and user requests for clarification.

These findings suggest that while the content is well-received overall, addressing concerns about accessibility and video experience could enhance viewer satisfaction.

Write file to csv

In [ ]:
comments_df.to_csv('comments_df.csv')

Final Conclusion

By analyzing the topics and sentiment distributions, I gained insights into what drives positive engagement (helpful tutorials and positive feedback), where users are facing challenges (e.g., accents and technical errors), and what general discussions are occurring (e.g., coding and general feedback). These insights can inform content improvements, better targeting of audience needs, and refining user engagement strategies.

Trend Analysis

Over time, positive and neutral sentiments have shown an increasing trend, indicating that the audience's overall engagement and satisfaction with the content are improving. On the other hand, negative sentiment has remained relatively flat, without significant growth, suggesting that while issues exist, they have not worsened over time.

Next Steps

To further refine content strategy and improve user experience, the following steps can be taken:

  1. Enhancing Accessibility: Since some negative feedback stems from difficulty understanding accents, providing subtitles, transcripts, or AI-generated voiceovers could improve accessibility and comprehension for a wider audience.

  2. Reducing Disruptions: Many negative comments mention ads. Exploring ways to minimize ad interruptions—such as strategically placing them at natural breaks—could lead to improved viewer retention and satisfaction.

  3. Encouraging Constructive Feedback: Given the predominance of positive and neutral engagement, fostering more structured feedback through polls or direct engagement with viewers could offer deeper insights into audience preferences.

  4. Optimizing Technical Explanations: Some comments request clarification on technical topics, indicating the need for supplemental materials like written guides, coding exercises, or additional explanation videos.

The Role of Emojis in Sentiment Analysis

One of the most interesting findings in this analysis was the impact of emoji translation on model performance. Emojis are a fundamental part of social media language, often conveying emotions, reactions, and context that words alone may not capture. Ignoring or removing them can lead to misinterpretations of sentiment.

For example:

  • The comment "Deepseek gonna takeover everything soon 😢" transitioned from Neutral to Negative after the emoji was translated.

  • "Hi, I started this course about a week ago and I am having a bit of trouble with the SQL intermediate phase and would really appreciate some assistance 😢" shifted from Neutral to Negative once the emoji was translated.

  • Comments containing only an emoji, like "❤," transitioned from Neutral to either Positive or Negative after translation.

These examples highlight the importance of accounting for emojis in sentiment analysis. By translating emojis instead of leaving them as-is or removing them, the model correctly interprets the emotional intent behind the comments. Since social media heavily relies on emojis to express tone, sarcasm, and emotion, incorporating them into sentiment analysis is essential for accurate classification.

Future Analysis

To further expand this study and gain deeper insights, the following analyses could be conducted:

  • Multilingual Sentiment Analysis: Considering non-English comments by translating them during preprocessing or using multilingual models.

  • Interactive Data Exploration: Developing a Streamlit or Tableau dashboard to allow interactive sentiment and topic analysis.

  • Comprehensive Dataset Expansion: Extending the analysis to cover all videos and comments, including replies for deeper engagement insights.

  • Temporal Sentiment Trends: Analyzing sentiment shifts over time to identify patterns in audience reception and content effectiveness.

  • Engagement Correlation: Exploring the relationship between comment sentiment and key video engagement metrics (likes, shares, watch time) to determine what influences audience interaction.

  • Video-Specific Insights: Performing a per-video sentiment analysis to identify which content resonates best with the audience and why.

By implementing these improvements and analyses, I can further refine content strategy, enhance audience engagement, and ensure that sentiment analysis models accurately reflect the nuances of online discussions.

In [ ]:
#HTML config
sidebar_html = """
<style>
  body {
    margin-left: 220px; /* Make space for the sidebar */
    font-family: Arial, sans-serif;
  }

  .sidebar {
    height: 100%;
    width: 220px;
    position: fixed;
    left: 0;
    top: 0;
    background-color: #2c3e50;
    padding-top: 10px;
    box-shadow: 2px 0px 5px rgba(0, 0, 0, 0.2);
    color: white;
    transition: width 0.3s;
  }

  .sidebar .section {
    padding: 12px 15px;
    font-size: 16px;
    font-weight: bold;
    cursor: pointer;
    background-color: #1a252f;
    border-top: 1px solid #34495e;
    display: flex;
    align-items: center;
    justify-content: space-between;
  }

  .sidebar .section:hover {
    background-color: #34495e;
  }

  .sidebar .section span {
    transition: transform 0.3s ease-in-out;
  }

  }

</style>

<div class="sidebar">
  <div class="section" onclick="scrollToSection('overview')">Overview</div>
  <div class="section" onclick="scrollToSection('api-call')">Fetch data from API</div>
  <div class="section" onclick="scrollToSection('data-cleaning')">Data Cleaning</div>
  <div class="section" onclick="scrollToSection('descriptive-statistics')">Descriptive Statistics</div>
  <div class="section" onclick="scrollToSection('sentiment-analysis')">Sentiment analysis</div>
  <div class="section" onclick="scrollToSection('topic-modeling')">Topic Modeling (BERTopic)</div>
  <div class="section" onclick="scrollToSection('final-conclusion')">Final Conclusion</div>
</div>


<script>
  function scrollToSection(id) {
    var section = document.getElementById(id);
    if (section) {
      section.scrollIntoView({ behavior: "smooth" });
    }
  }
</script>
"""

display(HTML(sidebar_html))
In [ ]: